Robust Regression on MapReduce

نویسندگان

  • Xiangrui Meng
  • Michael W. Mahoney
چکیده

Although the MapReduce framework is now the de facto standard for analyzing massive data sets, many algorithms (in particular, many iterative algorithms popular in machine learning, optimization, and linear algebra) are hard to fit into MapReduce. Consider, e.g., the `p regression problem: given a matrix A ∈ Rm×n and a vector b ∈ R, find a vector x∗ ∈ R that minimizes f(x) = ‖Ax− b‖p. The widely-used `2 regression, i.e., linear least-squares, is known to be highly sensitive to outliers; and choosing p ∈ [1, 2) can help improve robustness. In this work, we propose an efficient algorithm for solving strongly over-determined (m n) robust `p regression problems to moderate precision on MapReduce. Our empirical results on data up to the terabyte scale demonstrate that our algorithm is a significant improvement over traditional iterative algorithms on MapReduce for `1 regression, even for a fairly small number of iterations. In addition, our proposed interior-point cutting-plane method can also be extended to solving more general convex problems on MapReduce.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Support vector regression model for BigData systems

Nowadays Big Data are becoming more and more important. Many sectors of our economy are now guided by data-driven decision processes. Big Data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. In such systems, capacity a...

متن کامل

Large - Scale Non - Linear Regression within the Mapreduce Framework

Large-scale Non-linear Regression within the MapReduce Framework By: Ahmed Khademzadeh Thesis Advisor: Philip Chan, Ph.D. Regression models have many applications in real world problems such as finance, epidemiology, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets would help us to construct better models from the data. The issue with big datasets is that...

متن کامل

A robust least squares fuzzy regression model based on kernel function

In this paper, a new approach is presented to fit arobust fuzzy regression model based on some fuzzy quantities. Inthis approach, we first introduce a new distance between two fuzzynumbers using the kernel function, and then, based on the leastsquares method, the parameters of fuzzy regression model isestimated. The proposed approach has a suitable performance to<b...

متن کامل

Parallel extreme learning machine for regression based on MapReduce

Regression is one of the most basic problems in data mining. For regression problem, extreme learning machine (ELM) can get better generalization performance at a much faster learning speed. However, the enlarging volume of datasets makes regression by ELM on very large scale datasets a challenging task. Through analyzing the mechanism of ELM algorithm, an efficient parallel ELM for regression ...

متن کامل

ROUTE: run-time robust reducer workload estimation for MapReduce

MapReduce has become a popular model for large-scale data processing in recent years. Many works on MapReduce scheduling (e.g., load balancing and deadline-aware scheduling) have emphasized the importance of predicting workload received by individual reducers. However, because the input characteristics and user-specified map function of a given job are unknown to the MapReduce framework before ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013